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Abstract 

Background: The comparative modeling approach to protein structure prediction inherently relies on a template 
structure. Before building a model such a template protein has to be found and aligned with the query sequence. Any 
error made on this stage may dramatically affects the quality of result. There is a need, therefore, to develop accurate 
and sensitive alignment protocols. 

Results: BioShell threading software is a versatile tool for aligning protein structures, protein sequences or sequence 
profiles and query sequences to a template structures. The software is also capable of sub-optimal alignment 
generation. It can be executed as an application from the UNIX command line, or as a set of Java classes called from a 
script or a Java application. The implemented Monte Carlo search engine greatly facilitates the development and 
benchmarking of new alignment scoring schemes even when the functions exhibit non-deterministic 
polynomial-time complexity. 

Conclusions: Numerical experiments indicate that the new threading application offers template detection abilities 
and provides much better alignments than other methods. The package along with documentation and examples is 
available at: http : //bioshell .pl/threading3d. 



Background 

Protein structure prediction has become one of the key 
tasks in computational biology of the post-genomic era. 
Due to the growing size of structural databases, the most 
important and widely used method is homology model- 
ing. This methodology relies on the existence of structures 
of homologous protein(s) in databases. The major parts 
of this procedure are i) recognition of homology between 
two proteins and ii) correct alignment for the pair of two 
proteins for which homology was recognized. Here we 
focus on the latter, still challenging problem. Accurate 
alignment is essential for many state-of-the-art 3D protein 
structure prediction algorithms [1-4]. The development 
of novel threading algorithms however is hindered by i) 
lack of a general consensus on scoring schemes and ii) 
plethora of different variants of the same scoring function 
described in literature but not available as a ready-to-use 
software. 
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Our contribution presents a versatile tool for the fast 
and extensive aligning of two proteins with each other. 
The alignment can be based on i) the two sequences, ii) 
one sequence and one structure or Hi) on the two struc- 
tures. The first case, corresponding to pairvise sequence 
alignment is trivial and can be solved by dynamic pro- 
gramming. However, the other two (corresponding to 
3D threading and structure alignment, respectively) are 
NP-hard problems [5]. Our novel object-oriented applica- 
tion, incorporated within the BioShell package [6,7], is an 
integrated framework to heuristically tackle these protein- 
to-protein alignment problems. The application is written 
in Java language which facilitates its easy use on various 
systems and architectures. The advantages and novelties 
of the software over the existing and downloadable ones 
[8-12] are: 

i) it employs Monte Carlo (MC) to sample the 
alignment space so an approximate solution to 
NP-hard 3D threading problem can be found, 

ii) each scoring term type is a separate object that can be 
easily switched on/off and fully customized with user- 
provided data, e.g. in a single run, several secondary 
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structure similarity scores may be used, each of them 
based on a different secondary structure prediction, 

Hi) new potentials and scoring schemes can be easily 
implemented by users, 

iv) as a result, the user obtains the best scoring solution 
and a number of suboptimal alignments, ranked by 
their score; the alignments can be outputted in the 
Modeller [1] format file and easily used to build final 
model structures. 

v) it can be used as a structure alignment software, also 
capable of producing suboptimal structure 
alignments. 

vi) it can read in and score any arbitrary alignment 
provided by the user. This can be very helpful in the 
manual refinement of alignments or for threading 
force field development. 

The project website provides extensive documentation 
of the library (API) and numerous examples which show 
how to run the executable threading application and how 
to interact with the software library. 

Implementation 

The source code was divided into four main blocks: 
i) encoding alignment as system coordinates, ii) 
moves (alignment modification), Hi) scoring and iv) 
gathering results. Each of these components forms 
a separate sub-package in the source code tree: 
jbel . simulations . threading, jbel . simula- 
tions . threading . movers, jbel . simulations . 
threading. ff and jbel . simulations . thread- 
ing . observers, respectively. These routines are sup- 
ported by other generic BioShell components such as 
Monte Carlo sampling and I/O operations. For users 
convenience we provide also a stand-alone application. 
To run calculations, the user specifies: i) input data, ii) 
modification scheme and the of MC sampling and Hi) 
scoring function (force field). 

Alignment representation 

Protein-to-protein alignment between query (Q) and tem- 
plate (T) proteins is defined as a list of blocks (see Figure 1 
for an example). Every block represents a gapless align- 
ment stretch between a query and a template sequence 
[13]. An i th block is of the length Lu where the frag- 
ment starts at the I th position in the query sequence and 
at the J th position in the template sequence and ends in 
/ + Li and / + Li positions in the query and the template 
sequences respectively. In the course of code optimization 
we introduced two restrictions into the program. The first 
one states that any single block cannot be shorter than 
the MIN.BLOCK LENGTH which by default is set to 4. 
Making this value smaller (which can be easily done with 
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Figure 1 Graphical representation of an alignment in BioShell- 
threading. Portrayed alignment (drawn on an alignment matrix 
between a query and a template) consists of 4 aligned blocks. Red dot 
is an alignment of \ th query's residue to k th template's residue. Blue 
dot is an alignment of } th query's residue to k th template's residue. 



a command line option) results in considerably higher 
computational cost but occasionally might lead to better 
alignments. The second rule states that an alignment must 
consist of at least the MIN_NUMBER_OF_BLOCKS, by 
default equal to 4. This second rule is only a technical 
trick that gives all movers (see below) a chance to be exe- 
cuted successfully and thus make the sampling process 
more effective. Since we do not require two neighboring 
blocks to be separated by a gap, it is always possible to 
represent a long alignment block as concatenation of sev- 
eral shorter ones. For instance, a perfect alignment of a 
100 amino acid sequence with itself may be defined as 
25 blocks of 4 residues each, by 4 blocks of 25 residues, 
or by many other combinations of blocks which are not 
separated by gaps. All of them however lead to the same 
alignment and have the same score. Thus the second rule 
does not limit the sampled conformational space. More- 
over, both restricting parameters may be changed by the 
user from the command line. 

Alignment sampling 

A list of blocks defines a point in the conformational space 
of all possible sequence alignments between two proteins. 
Sampling of this space is performed by a set of movers i.e. 
objects that attempt to modify an alignment. The follow- 
ing seven types of movers have been implemented so far: 
Shift a block, Shrink/Expand a block, Merge two blocks, 
Split one block into two new blocks, Jump part of the one 
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block to a neighbor block, Create a new block and Anni- 
hilate a block. The move types have been schematically 
depicted in the Figure 2. The background grid represents 
a classic Dynamic Programming (DP) matrix; solid and 
dashed lines denote an alignment before and after a move, 
respectively. BlockShift shifts a block horizontally and/or 
vertically on the DP matrix with a uniform distribution 
within the allowed space (blocks cannot overlap). Block- 
Shrink & BlockExpand can shrink or expand a block on 
either end within the allowed space. A block cannot shrink 
to a length shorter than the MIN_BLOCK_LENGTH. 
The length N of shrinking/expanding is generated with 
1/2 N distribution. Bio ckPart Jump performs a jump of 
a part of a block to a neighboring block. The length of 
the jumping part was generated with uniform distribu- 
tion which does not violate the MIN_BLOCK_LENGTH. 
BlockSplit can split a block into two parts with con- 
servation of the MIN_BLOCK_LENGTH for both of 
the two parts. The split is performed with a uniform 
distribution. BlocksMerge can merge two neighboring 
blocks when possible. Merging continues as long as 
the MIN_NUMBER_OF_BLOCKS is fulfilled. BlockAn- 
nihilate can annihilate one of the alignment blocks if 
the MIN_NUMBER_OF_BLOCKS is not violated. Finally, 
BlockCreate creates a new block. The user can define 
how often each of the move types is attempted. Once 
a mover has been executed and an alignment modified, 
the new conformation is accepted (or rejected) accord- 
ing to the Metropolis criteria [14]. The sampling process 
is governed either by Simulated Annealing (SA) [15] or 
by Replica Exchange Monte Carlo (REMC) [16]. The lat- 
ter offers a very effective way to sample the conforma- 
tional space of all possible alignments. As an example 
(see Figure 3), we show energy distributions obtained by a 
REMC search comprising ten replicas running at 10 dis- 
tinct temperatures: 0.08, 0.1, 0.15, 0.2, 0.5, 1.0, 2.0, 5.0, 7.0, 
10.0. The distributions exhibit large overlap with neigh- 
boring replicas which facilitates random walk in the tem- 
perature space and results in highly enhanced sampling. 
In this particular example 2pcyA query chain was mod- 
elled on 2azaA template with (-TMScore) as the energy 
function. However, we found this set of temperatures to 
be very universal and working very efficiently for various 
scoring schemes and different protein lengths. There- 
fore REMC with these temperature settings (if not stated 
otherwise) was used for all the numerical experiments 
described in this contribution. 

Alignment scoring 

Each particular alignment is assessed by a scoring func- 
tion (force field). This function is defined as weighted 
combination of distinct terms. The scoring terms imple- 
mented in the package can be divided into six cate- 
gories: (i) structure based scores, such as TM-score [17] 
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Figure 2 Different types of moves in BioShell-Threading 
software. A) Block Shift B) Shrink/Expand C) Jump the part of the 
Block D&E) Split/Merge Blocks F) Annihilating of Block. For every 
simulation in this contribution, probability of performing a move by 
mover was with ratio: 1 00 : 1 00 : 1 00 : 1 0 : 1 0 : 1 for A : B : C : D : E : F, 
respectively. The dashed lines represent moved (panel A) and resized 
(panels B and C) alignment fragments. 



or cRMSD, (ii) contact potentials such as Miyazawa- 
Jerningan [18,19], (Hi) environmental potentials, e.g. sol- 
vent accessibility score (iv) sequence profile similarity 
measures, (v) secondary structure similarity measures and 
(vi) gap penalty functions. The full list of available scor- 
ing methods is provided in Table 1. Group (i) of scores 
require the query structure to be provided and can be 
used either for benchmarking purposes (when the query 
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structure is actually known) or for a structure-to-structure 
Monte Carlo (MC) alignment. Group (ii) scores transform 
a template contact map onto a query sequence based on 
a current alignment. Finally, scores from groups (Hi), (iv), 
and (v) use various kinds of profiles: sequence profiles, 
predicted propensity for a certain secondary structure 
type, predicted solvent exposure level etc. 

An inheritance diagram depicting basic relationships 
between the classes is shown in the Figure 4, in which 
each rectangle denotes an abstract interface and a box 
with rounded corners - an implemented score type. All the 
score types are derived from AlignmentEnergy. Some 
of them also inherit from ByAtomEnergy which means 
that the score value may be decomposed into a sum over 
all aligned residue pairs. SubstitutionScore based 
on a substitution matrix such as the BLOSSUM [36] or 
PAM [37] matrix is obviously a ByAtomEnergy example, 
while RMSDAlignScore cannot be implemented in this 
manner. ByAtomEnergy ability is very important for the 
program as it enables the calculate energy change without 
the evaluating whole alignment. One a mover has altered 
an alignment, it returns a list of alignment columns (i.e. 
the query residue indexes) that have been affected. Then 



Score type 


Description 




Al ignemntEnergy 


Base class for all alignment scores 




ByAt omEne rgy 


Score depends solely on a single position in a target and the aligned position in 
a template 


StructureBasedScore 


Knows about target and template atomic coordinates 




BigMatrixEnergy 


Pairwise per-position energy may be pre-calculated and stored in a 2D array 


ContactBasedEnergy 


Energy that depends on a template contact map 




PairwiseCont act Energy 


Contact based energy that is pairwise-decomposable, but can't be precalculated 
(otherwise it would result in BigMatrixEnergy score) 


Score name 


Derived from 


References 


DaliScore 


StructureBasedScore 


[20] 


RMSDAlignScore 


StructureBasedScore 


[21] 


TMAlignScore 


StructureBasedScore 


[17] 


GoLikeScore 


ContactBased Potential 


[22,23] 


TwoBodyContact 


ContactBasedPotential 


[18,19,24-27] 


P2PScore 


ContactBasedPotential 


[28] 


StrcGapPenalty 


ContactBasedPotential 


[29] 


Af f ineGap 


Al ignmentEnergy 


[30] 


AsaScore 


BigMatrixEnergy 


[30] 


EnvScore 


BigMatrixEnergy 


[31] 


GravyScore 


BigMatrixEnergy 


[32] 


ProbabilisticSecondaryScore 


BigMatrixEnergy 


[33] 


Prof ileScore 


BigMatrixEnergy 


[34,35] 


SubstitutionScore 


BigMatrixEnergy 


[36] 



=s 7.5 



5.0 




-0.7 



-0.6 



-0.5 -0.4 -0.3 
E = -tmscore 



-0.2 



-0.1 



Figure 3 Energy distributions overlap for 1 0 replicas in 1 0 
different temperatures. Counts for every replica are depicted as a 
logarithm. The temperatures vary from 0.08 to 1 0 dimensionless units. 
A significant overlap between replicas suggest that the number and 
temperatures distribution are chosen correctly. 



Table 1 Potentials implemented within BioShell-threading 
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ByAtomEnergy 



calculateByAtomEnergy ( ) 



T 



AlignmentEnergy 



!-{> BigMatrixEnergy 



calculateEnergy ( ) 
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StructureBasedScore 
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ContactBasedEnergy 
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PairwiseContactEnergy 



-[TMAlignScore 



RMSDAlignScore 



StrGapPenalty 



TwoBodyContact ] 
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ProfileEnergy 
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Figure 4 Energy types inheritance diagram. All interfaces and the 
most important scoring types are shown in the form of UML diagram 
with arrows denoting class inheritance. 



the program evaluates the energy difference at these posi- 
tions only. Further, some of the ByAtomEnergy scores 
can be pre-calculated for every query-template residue 
position once the program is started and stored in a 
2D array. Such scores implement a BigMatrixEnergy 
interface and Subs ti tut ionScore is an example of 
it. An interesting case is StrGapPenalty derived from 
ContactBasedEnergy. Indeed, the penalty for a gap 
introduced into a template structure is assessed based on 
the number of lost contacts. The user can easily imple- 
ment ones own scoring function by extending one of the 
provided abstract classes [30]. 

Results and discussion 

Here we use the application in two real life examples to 
demonstrate the robustness and possible applications of 
the software. The scripts used in the experiments with the 
relevant input data were published on the project website. 

Threading as a structural alignment algorithm 

Threading where both the query and the template pro- 
tein structures are known is equivalent to the protein 
structural alignment. The calculation of such alignments 
is a perfect test for searching strategies. In the Bioshell- 
Threading package there are three scoring functions 
which can be used for this purpose: TM-score, RMSD 
and DALI-score. In this contribution we compare the TM- 
align [17] algorithm with 3D threading in which -TMscore 



[17] was used as an energy function. The benchmark set 
[38] comprises more than 1000 pairs of homologous pro- 
teins. REMC simulation was performed for 10 replicas 
with temperatures distributed from 0.08 to 10.0 dimen- 
sionless units. It can be be seen in the Figure 5A that 
alignments found by REMC search are in most cases 
very close to TM-align results. In a very few cases, how- 
ever, 3D threading can find a significantly better match 
which suggests that the heuristic search implemented in 
TM-align, although very fast, does not always find the 
optimal solution. The calculations were repeated with the 
fragment-based variant of TM-align (fr-TM-align) [39] 
which resolved virtually all of the discrepancies. On the 
other hand, other structural alignment tools such as CE or 
DALI yielded alignments with worse TM-scores (data not 
shown). This was expected since these tools were designed 
to optimize their own Z-scores rather than the TM-score 
parameter. It should be noted that this parameter was 
arbitrarily chosen in this experiment to test sampling effi- 
ciency. Searching with BioShell-Threading also generates 
sub-optimal structural alignments which are often very 
close (within 0.01 TM-score), but may differ significantly 
from the optimal solution (data not shown). 

Quality of query sequence-template structure alignments 

To test the Threading algorithm on more realistic prob- 
lems, the MALIDUP [40] benchmark has been used. 
Results are shown in the Figure 5B. MALIDUP bench- 
mark comprises 241 protein pairs of diverged dupli- 
cated domains. It was chosen because the evolutionary 
relation for the domains under consideration is fairly 
recognized and not biased by sequence similarity. The 
3D-Threading algorithm was compared with four other 
methods: global sequence alignment with the BLOSUM62 
matrix (optimized in [38]), profile-to-profile alignment 
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tmscore by Threading 3D 
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alignment method 



Figure 5 Example results. (A) REMC-Threading algorithm (abscissa) 
as a structure alignment software compared to TM-align (ordinate) on 
more than 1 000 protein pairs. (B) Comparison between BLOSUM62, 
PICASS03, HHAIign, Threading 1 D and REMC-Threading: white bars - 
average TMscore, striped bars - average AI 0 p, dotted bars - average 
Al 4 p alignment quality on MALIDUP set. The best value for all the 
three score types is 1.0. 
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(optimized in [38]) with the PICASS03 [34] scoring func- 
tion, ThreadinglD [38,41] and the widely used, state-of- 
the-art method: HHAlign [10,42,43]. For profile-to-profile, 
threading ID and threading 3D runs sequence profiles 
were generated with five PsiBlast [44] iterations against 
the NR90 database and e-value threshold below 0.00001. 
For the HHalign algorithm, multiple sequence alignments 
(MSA), were created in the local searching mode with 
hhblits [43] on the NR20 database created on January 
10, 2011. Subsequently, these MSAs were used in align- 
ing query and template sequences with hhalign, in the 
global alignment mode. 

The following scoring terms were used: EnvScore, 
ProbabilisticSecondaryScore, PICASS03, Go 
LikeScore, TwoBodyContact with Miyazawa-Jernigan 
contact scoring and StrcGap Penalty. The following 
weights: 0.1, 0.25, 0.5, 1.0, 0.4 and 0.15, respectively were 
optimized on the ProSup [45] dataset. The objective of 
this test was the quality of calculated alignments i.e. 
average TM- score and alignment overlap with manually 
curated alignments. The latter was measured as the per- 
cent of correctly aligned positions ALoP and the fraction 
of aligned positions predicted with an error of at most 
four positions AL4P. For the MALIDUP set it can be 
observed that the threading algorithm, which incorpo- 
rates 3D information from the template structure (column 
R' in the Figure 5B), performs better than the other 
tested algorithms, both in respect of average TM-score 
and overlap with manual alignments (as assessed by ALoP 
and AL4P scores). Profile-based aligners: Threading ID 
and HHAlign perform comparably on these benchmarks, 
whereas BioShell-Threading performs much better. In 
particular, when compared to HHAlign (column 'H'), it 
achieves approximately 0.09 higher ALoP and 0.24 higher 
AL4P. This is partially due to the fact, that in case of 
unrecognized homology, HHAlign returns a null align- 
ment (which affects the averaged score value). There are 
some possible applications of this result. It can be used to 
generate alignment boundaries for protein modeling algo- 
rithms which can use such the information [32,46] . In this 
case the alignment boundary is the range for every query's 
residue to which it can align within the template structure. 
It is also possible, using sub-optimal alignments, to cre- 
ate more diverse spatial constraints for algorithms such as 
Modeller [1]. 

Practical considerations 

The computational approach utilized in this contribution 
is an example of a stochastic simulation rather than a typ- 
ical alignment method. User has to define a number of 
parameters to control this process, such as the number of 
Monte Carlo replicas and the respective set of their tem- 
peratures. Fortunately, several methods have been devised 
for REMC parameters selection, e.g. [32,47]. In general, 



these parameters depend both on query and template pro- 
teins and should be optimized separately for each case. 
However, for the sake of simplicity, for any benchmark cal- 
culation presented in this contribution we used the same 
set of ten replicas as described above. This temperature 
set is wide enough to obtain good results for all the test 
cases but inevitably increases the computational effort. 
Optimization of these parameters might also occasionally 
lead to better alignments. However, even for the optimal 
set of parameters it takes from several minutes to more 
than an hour to reliably sample the low energy area of the 
alignment space. The three-dimensional threading Monte 
Carlo simulation will always be at least an order of magni- 
tude slower than a dynamic programming calculation but 
is usually faster than RAPTOR - another three dimensional 
threading where calulations even for short sequences 
take more than an hour [48]. Raptor method how- 
ever employs branch-and-bound approach and, unlike a 
stochastic simulation, the reach of the global optimum 
of a scoring function is guaranteed. Other parameters a 
user should optimize are: scoring function weights and 
probabilities of particular alignment modifications (i.e. 
moves). The extensive study of this parameter space is 
beyond the scope of this paper and will be described 
elsewhere. In this work, to avoid overtraining, we opti- 
mized the scoring function and movers set on ProSup 
data set [49], which has not been used for benchmarking 
purposes. 

The 3D threading application can also be used as a 
structural alignment method. In the presented bench- 
mark, it has been compared with TMAlign and yielded 
nearly the same results. CPU time required by TMA- 
lign was however about two orders of magnitude shorther 
(minutes to an hour by the threading vs seconds by TMA- 
lign). This result is a direct consequence of the number 
of times each of the two programs calls the TM-score 
evaluation routine. In order to test the convergence of 
the threading, calculations were started from a random 
alignment. During the simulation TM-score has to be 
evaluated at every Monte Carlo move which, unlike scores 
derived from ByAtomEnergy, cannot be recalculated 
locally just for the moved block. TMAlign, on the other 
hand, starts from an alignment computed by a dynamic 
programming procedure and evaluates TM-score a few 
times until convergence is reached. The threading simu- 
lation however provides a numer of different suboptimal 
alignments which all fall within 0.01 range in TM-score 
units. 

Conclusions 

BioShell-Threading implements a three-dimensional pro- 
tein threading algorithm based on a Monte Carlo search 
scheme. The code has been written in Java language which 
makes it virtually machine independent. It implements 
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numerous scoring (energy) functions. Some of them can 
be applied in regular Dynamic Programming. For oth- 
ers, the optimization becomes a NP-hard problem and 
demand more time consuming methods (e.g. MC). The 
package provides a ready to use command-line application 
and a Java software library. This makes BioShell-Threading 
a component that can be very easily incorporated into 
larger protein structure pipelines [50] . By providing subop- 
timal alignments, the package can increase the accuracy of 
widely used protein folding softwares and proteins struc- 
ture classification methods. However, the main goal of 
BioShell-Threading is the refinement of query-to-template 
alignments. At the time, when fold recognition methods 
are fast and quite accurate, alignment accuracy is the lim- 
iting factor. Thus using certain fast algorithms [10,41,44] 
to search the whole protein databases and then refine top 
hits with more sophisticated scoring function seems to be of 
a great value to the protein modeling community. 
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